Recent research in clustering face embeddings has found that unsupervised, shallow, heuristic-based methods -- including $k$-means and hierarchical agglomerative clustering -- underperform supervised, deep, inductive methods. While the reported improvements are indeed impressive, experiments are mostly limited to face datasets, where the clustered embeddings are highly discriminative or well-separated by class (Recall@1 above 90% and often nearing ceiling), and the experimental methodology seemingly favors the deep methods. We conduct a large-scale empirical study of 17 clustering methods across three datasets and obtain several robust findings. Notably, deep methods are surprisingly fragile for embeddings with more uncertainty, where they match or even perform worse than shallow, heuristic-based methods. When embeddings are highly discriminative, deep methods do outperform the baselines, consistent with past results, but the margin between methods is much smaller than previously reported. We believe our benchmarks broaden the scope of supervised clustering methods beyond the face domain and can serve as a foundation on which these methods could be improved. To enable reproducibility, we include all necessary details in the appendices, and plan to release the code.
translated by 谷歌翻译
最近的工作据称,利用Softmax跨熵的分类损失不仅可以用于固定设定的分类任务,而且还通过专门为开放式任务开发的优于开销的损失,包括几次射击学习和检索。使用不同的嵌入几何形状研究了软MAX分类器 - 欧几里德,双曲线和球形,并且已经对一个或另一个的优越性进行了索赔,但它们没有得到精心控制的系统。我们对各种固定设定分类和图像检索任务的软MAX损失嵌入几何的实证研究。对于球形损失观察到的一个有趣的财产导致我们提出了一种基于VON MISES-FISHER分配的概率分类器,我们表明它具有最先进的方法竞争,同时生产出完善的盒子校准。我们提供有关亏损之间的权衡以及如何在其中选择的指导。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
Autonomous vehicles are being deployed with a spectrum of capability, extending from driver assistance features for the highway in personal vehicles (SAE Level 2+) to fully autonomous fleet ride sharing services operating in complex city environments (SAE Level 4+). This spectrum of autonomy often operates in different physical environments with different degrees of assumed driver in-the-loop oversight and hence have very different system and subsystem requirements. At the heart of SAE Level 2 to 5 systems is localization and mapping, which ranges from road determination for feature geofencing or high-level routing, through lane determination for advanced driver assistance, to where-in-lane positioning for full vehicle control. We assess localization and mapping requirements for different levels of autonomy and supported features. This work provides a framework for system decomposition, including the level of redundancy needed to achieve the target level of safety. We examine several representative autonomous and assistance features and make recommendations on positioning requirements as well map georeferencing and information integrity.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译
本文报告了对使用一辆或多种无人地面车辆(USV)快速识别通道的快速识别通道问题的研究。一种称为基于建议的自适应通道搜索(PBAC)的新算法作为一种潜在的解决方案,可改善当前方法。将PBAC的经验性能与割草机测量和马尔可夫决策过程(MDP)计划进行了比较,该计划具有两个最先进的奖励功能:上限置信度(UCB)和最大价值信息(MVI)。通过比较使用一个,两个,三个或四个USV识别连续通道的时间来评估每种方法的性能。在十个模拟的测深场景和一个野外区域中比较每种方法的性能,每种方法都有不同的频道布局。模拟和现场试验的结果表明,平均多车辆PBAC优于基于割草机,UCB和基于MVI的方法,尤其是在使用至少三辆车辆时。
translated by 谷歌翻译
建筑物中的加热和冷却系统占全球能源使用的31 \%,其中大部分受基于规则的控制器(RBC)调节,这些控制器(RBC)既不通过与电网进行最佳交互来最大化能源效率或最小化排放。通过强化学习(RL)的控制已显示可显着提高建筑能源效率,但是现有的解决方案需要访问世界上每栋建筑物都无法期望的特定建筑模拟器或数据。作为回应,我们表明可以在没有这样的知识的情况下获得减少排放的政策,这是我们称为零射击建筑物控制的范式。我们结合了系统识别和基于模型的RL的想法,以创建PEARL(概率避免发射的增强学习),并表明建立表现模型所需的短期积极探索是所需的。在三个不同的建筑能源模拟的实验中,我们显示珍珠在所有情况下都优于现有的RBC,并且在所有情况下,流行的RL基线,在维持热舒适度的同时,将建筑物排放量减少了31 \%。我们的源代码可通过https://enjeener.io/projects/pearl在线获得。
translated by 谷歌翻译
我们介绍了Audioscopev2,这是一种最先进的通用音频视频在屏幕上的声音分离系统,该系统能够通过观看野外视频来学习将声音与屏幕上的对象相关联。我们确定了先前关于视听屏幕上的声音分离的几个局限性,包括对时空注意力的粗略分辨率,音频分离模型的收敛性不佳,培训和评估数据的差异有限,以及未能说明贸易。在保存屏幕声音和抑制屏幕外声音之间的关闭。我们为所有这些问题提供解决方案。我们提出的跨模式和自我发场网络体系结构随着时间的推移以精细的分辨率捕获了视听依赖性,我们还提出了有效的可分离变体,这些变体能够扩展到更长的视频而不牺牲太多性能。我们还发现,仅在音频上进行预训练模型可大大改善结果。为了进行培训和评估,我们从大型野外视频数据库(YFCC100M)中收集了新的屏幕上的人类注释。这个新数据集更加多样化和具有挑战性。最后,我们提出了一个校准过程,该过程允许对屏幕重建与屏幕外抑制进行精确调整,从而大大简化了具有不同操作点的模型之间的性能。总体而言,我们的实验结果表明,在屏幕上的分离性能在更一般条件下的屏幕分离性能的改善要比以前具有最小的额外计算复杂性的方法更为普遍。
translated by 谷歌翻译
建筑物中的供暖和冷却系统占全球能源使用的31%,其中大部分受基于规则的控制器(RBC)调节,这些控制器(RBC)既不通过与网格最佳交互来最大程度地提高能源效率或最小化排放。通过增强学习(RL)的控制已显示可显着提高建筑能源效率,但是现有的解决方案需要在模拟器中进行预训练,这些模拟器对世界上每栋建筑物的获得非常昂贵。作为回应,我们表明可以通过结合系统识别和基于模型的RL的想法来对建筑物进行安全,零射击的控制。我们称这种组合珍珠(概率避免施加加固的增强学习),并表明它可以减少排放而无需预先培训,只需要三个小时的调试期。在三个不同的建筑能源模拟的实验中,我们显示珍珠在所有情况下都胜过现有的RBC,并且在所有情况下,流行的RL基线,在维持热舒适度的同时,将建筑物排放量降低了31%。
translated by 谷歌翻译